Improving cache locality for GPU-based volume rendering

نویسندگان

  • Yuki Sugimoto
  • Fumihiko Ino
  • Kenichi Hagihara
چکیده

We present a cache-aware method for accelerating texture-based volume rendering on a graphics processing unit (GPU). Because a GPU has hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize performance for memory-intensive applications. Our method localizes texture memory reference according to the location of the viewpoint and dynamically selects the width and height of thread blocks (TBs) so that each warp, which is a series of 32 threads processed simultaneously, can minimize memory access strides. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize TB size to exploit spatial locality with fewer resident TBs. For viewpoints with relatively large strides, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. Experimental results indicate that our cache-aware method doubles the worst rendering performance compared to those provided by the CUDA and OpenCL software development kits.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Cache Locality for Ray Casting with CUDA

In this paper, we present an acceleration method for texture-based ray casting on the compute unified device architecture (CUDA) compatible graphics processing unit (GPU). Since ray casting is a memory-intensive application, our method increases the hit rate of the texture cache during rendering. To achieve this, our method dynamically selects the width and height of thread blocks (TBs) such th...

متن کامل

Improving Inter-thread Data Sharing with GPU Caches

The massive amount of fine-grained parallelism exposed by a GPU program makes it difficult to exploit shared cache benefits even there is good program locality. The non deterministic feature of thread execution in the bulk synchronize parallel (BSP) model makes the situation even worse. Most prior work in exploiting GPU cache sharing focuses on regular applications that have linear memory acces...

متن کامل

Shear-rotation-warp volume rendering

Shear–warp volume rendering has the advantages of a moderate image quality and a fast rendering speed. However, in the case of dynamic changes in the opacity transfer function, the efficiency of memory access drops, as the method cannot exploit pre-classified volumes. In this paper, we propose an efficient algorithm that exploits the spatial locality of memory references for interactive classif...

متن کامل

Zippy: A Framework for Computation and Visualization on a GPU Cluster

Due to its high performance/cost ratio, a GPU cluster is an attractive platform for large scale general-purpose computation and visualization applications. However, the programming model for high performance generalpurpose computation on GPU clusters remains a complex problem. In this paper, we introduce the Zippy framework, a general and scalable solution to this problem. It abstracts the GPU ...

متن کامل

GigaVoxels: A Voxel-Based Rendering Pipeline For Efficient Exploration Of Large And Detailed Scenes

In this thesis, we present a new approach to efficiently render large scenes and detailed objects in realtime. Our approach is based on a new volumetric pre-filtered geometry representation and an associated voxel-based approximate cone tracing that allows an accurate and high performance rendering with high quality filtering of highly detailed geometry. In order to bring this voxel representat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Parallel Computing

دوره 40  شماره 

صفحات  -

تاریخ انتشار 2014